Initial Setup
The data I am using in this assignment is the gapminder dataset.
# Load the packages needed
# install.packages("prettydoc")
suppressPackageStartupMessages(library(prettydoc))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(forcats))
suppressPackageStartupMessages(library(ggthemes))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(grid))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(plotly))
# Source: https://github.com/dgrtwo/gganimate
# install.packages("cowplot") # a gganimate dependency
# devtools::install_github("dgrtwo/gganimate")
suppressPackageStartupMessages(library(gganimate))Part 1: Factor Management
Drop Oceania
Task Description: Filter the Gapminder data to remove observations associated with the continent of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.
Before making any changes to the dataset, let us review the dimension and structure of the original dataset. Note that gapminder dataset has 1704 rows and 6 columns, among which the continent Oceania has 24 observations.
# Review on the dimension and structure of the original dataset
dim(gapminder)## [1] 1704 6
# check the continent counts
continent_tbl <- as.data.frame(table(gapminder$continent))
# make the table of the continent counts
continent_tbl%>%
kable("html", caption = "Continent Counts",col.names = c("Continent", "Counts")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "20em")| Continent | Counts |
|---|---|
| Africa | 624 |
| Americas | 300 |
| Asia | 396 |
| Europe | 360 |
| Oceania | 24 |
After checking out the original dataset, I first filtered the dataset to remove the observations associated with the continent of Oceania. The function levels() provides access to the levels attribute of a variable. Using this function, we can see that the level “Oceania” is still in the attribute of the variable, which means, only by removing the observations the level is not dropped from the original dataset.
The function droplevels() is used to drop unused levels from a factor or, more commonly, from factors in a data frame. Here I use this function to drop the level “Oceania” which is no longer in the filtered dataset.
# Filter the data to remove the observations from Oceania
new_dat <- gapminder %>%
filter(continent!="Oceania")
# access the levels attribute of the variable continent
levels(new_dat$continent)## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# manually drop the level
new_dat_drop <- new_dat %>%
droplevels()
# access the levels attribute of the variable continent after dropping unused levels
levels(new_dat_drop$continent)## [1] "Africa" "Americas" "Asia" "Europe"
From the table of continent counts for the two cases we can see that, just by removing the observations associated with one level, R will make the observation counts for the corresponding level zero but keep the level unchanged.
| Continent | Counts |
|---|---|
| Africa | 624 |
| Americas | 300 |
| Asia | 396 |
| Europe | 360 |
| Oceania | 0 |
| Continent | Counts |
|---|---|
| Africa | 624 |
| Americas | 300 |
| Asia | 396 |
| Europe | 360 |
Reorder the levels of country or continent.
Use the forcats package to change the order of the factor levels, based on a principled summary of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the median.
I will reorder the levels of country by the maximum gdpPercap of each country.
# reorder the levels of country by the maximum `gdpPercap` of each country.
country_reorder <- gapminder$country %>%
fct_reorder(gapminder$gdpPercap, max)
# levels after reordering
head(levels(country_reorder))## [1] "Burundi" "Ethiopia" "Malawi" "Zimbabwe" "Liberia"
## [6] "Mozambique"
# comparing with levels of original dataset
head(levels(gapminder$country))## [1] "Afghanistan" "Albania" "Algeria" "Angola" "Argentina"
## [6] "Australia"
Explore the effects of arrange()
In order to explore the effects of arrange, first I will create a smaller subset of the gapminder dataset. Then I use fct_reorder and arrange respectively to manipulate the data. The resulting tables below clearly show the difference between the effects of arrange() and fct_reorder: reordering will not change the order of observations(since in gapminder, the data are already sorted within each level), while arrange() will sort all the observations based on our specification.
# Get observations from continent `Europe` and randomly select 5 countries
sub_dat <- gapminder %>%
filter(continent == "Europe")
# randomly sample 5 countries from the continent `Europe`
set.seed(0)
spl_id <- sample(unique(sub_dat$country), 5)
sub_dat <- sub_dat %>% filter(country %in% spl_id)
# reordering by gdpPercap
reorder_dat <- sub_dat %>%
mutate(country = fct_reorder(country, gdpPercap, .desc = TRUE))
# arranging by gdpPercap
arrange_dat <- sub_dat %>%
group_by(country) %>%
arrange(gdpPercap)
# arrange(desc(gdpPercap))# make the table of original subdata
sub_dat %>%
kable("html", caption = "Table of the newly created sub-dataset",
col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, width = "10em", border_right = T) %>%
column_spec(2, width = "10em") %>%
scroll_box(width = "900px", height = "400px")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Denmark | Europe | 1952 | 70.780 | 4334000 | 9692.385 |
| Denmark | Europe | 1957 | 71.810 | 4487831 | 11099.659 |
| Denmark | Europe | 1962 | 72.350 | 4646899 | 13583.314 |
| Denmark | Europe | 1967 | 72.960 | 4838800 | 15937.211 |
| Denmark | Europe | 1972 | 73.470 | 4991596 | 18866.207 |
| Denmark | Europe | 1977 | 74.690 | 5088419 | 20422.901 |
| Denmark | Europe | 1982 | 74.630 | 5117810 | 21688.040 |
| Denmark | Europe | 1987 | 74.800 | 5127024 | 25116.176 |
| Denmark | Europe | 1992 | 75.330 | 5171393 | 26406.740 |
| Denmark | Europe | 1997 | 76.110 | 5283663 | 29804.346 |
| Denmark | Europe | 2002 | 77.180 | 5374693 | 32166.500 |
| Denmark | Europe | 2007 | 78.332 | 5468120 | 35278.419 |
| Germany | Europe | 1952 | 67.500 | 69145952 | 7144.114 |
| Germany | Europe | 1957 | 69.100 | 71019069 | 10187.827 |
| Germany | Europe | 1962 | 70.300 | 73739117 | 12902.463 |
| Germany | Europe | 1967 | 70.800 | 76368453 | 14745.626 |
| Germany | Europe | 1972 | 71.000 | 78717088 | 18016.180 |
| Germany | Europe | 1977 | 72.500 | 78160773 | 20512.921 |
| Germany | Europe | 1982 | 73.800 | 78335266 | 22031.533 |
| Germany | Europe | 1987 | 74.847 | 77718298 | 24639.186 |
| Germany | Europe | 1992 | 76.070 | 80597764 | 26505.303 |
| Germany | Europe | 1997 | 77.340 | 82011073 | 27788.884 |
| Germany | Europe | 2002 | 78.670 | 82350671 | 30035.802 |
| Germany | Europe | 2007 | 79.406 | 82400996 | 32170.374 |
| Italy | Europe | 1952 | 65.940 | 47666000 | 4931.404 |
| Italy | Europe | 1957 | 67.810 | 49182000 | 6248.656 |
| Italy | Europe | 1962 | 69.240 | 50843200 | 8243.582 |
| Italy | Europe | 1967 | 71.060 | 52667100 | 10022.401 |
| Italy | Europe | 1972 | 72.190 | 54365564 | 12269.274 |
| Italy | Europe | 1977 | 73.480 | 56059245 | 14255.985 |
| Italy | Europe | 1982 | 74.980 | 56535636 | 16537.483 |
| Italy | Europe | 1987 | 76.420 | 56729703 | 19207.235 |
| Italy | Europe | 1992 | 77.440 | 56840847 | 22013.645 |
| Italy | Europe | 1997 | 78.820 | 57479469 | 24675.024 |
| Italy | Europe | 2002 | 80.240 | 57926999 | 27968.098 |
| Italy | Europe | 2007 | 80.546 | 58147733 | 28569.720 |
| Slovak Republic | Europe | 1952 | 64.360 | 3558137 | 5074.659 |
| Slovak Republic | Europe | 1957 | 67.450 | 3844277 | 6093.263 |
| Slovak Republic | Europe | 1962 | 70.330 | 4237384 | 7481.108 |
| Slovak Republic | Europe | 1967 | 70.980 | 4442238 | 8412.902 |
| Slovak Republic | Europe | 1972 | 70.350 | 4593433 | 9674.168 |
| Slovak Republic | Europe | 1977 | 70.450 | 4827803 | 10922.664 |
| Slovak Republic | Europe | 1982 | 70.800 | 5048043 | 11348.546 |
| Slovak Republic | Europe | 1987 | 71.080 | 5199318 | 12037.268 |
| Slovak Republic | Europe | 1992 | 71.380 | 5302888 | 9498.468 |
| Slovak Republic | Europe | 1997 | 72.710 | 5383010 | 12126.231 |
| Slovak Republic | Europe | 2002 | 73.800 | 5410052 | 13638.778 |
| Slovak Republic | Europe | 2007 | 74.663 | 5447502 | 18678.314 |
| Sweden | Europe | 1952 | 71.860 | 7124673 | 8527.845 |
| Sweden | Europe | 1957 | 72.490 | 7363802 | 9911.878 |
| Sweden | Europe | 1962 | 73.370 | 7561588 | 12329.442 |
| Sweden | Europe | 1967 | 74.160 | 7867931 | 15258.297 |
| Sweden | Europe | 1972 | 74.720 | 8122293 | 17832.025 |
| Sweden | Europe | 1977 | 75.440 | 8251648 | 18855.725 |
| Sweden | Europe | 1982 | 76.420 | 8325260 | 20667.381 |
| Sweden | Europe | 1987 | 77.190 | 8421403 | 23586.929 |
| Sweden | Europe | 1992 | 78.160 | 8718867 | 23880.017 |
| Sweden | Europe | 1997 | 79.390 | 8897619 | 25266.595 |
| Sweden | Europe | 2002 | 80.040 | 8954175 | 29341.631 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.748 |
# make the result table of reordering
reorder_dat %>%
kable("html", caption = "Result Table after reordering by `gdpPercap`",
col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, width = "10em", border_right = T) %>%
column_spec(2, width = "10em") %>%
scroll_box(width = "900px", height = "400px")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Denmark | Europe | 1952 | 70.780 | 4334000 | 9692.385 |
| Denmark | Europe | 1957 | 71.810 | 4487831 | 11099.659 |
| Denmark | Europe | 1962 | 72.350 | 4646899 | 13583.314 |
| Denmark | Europe | 1967 | 72.960 | 4838800 | 15937.211 |
| Denmark | Europe | 1972 | 73.470 | 4991596 | 18866.207 |
| Denmark | Europe | 1977 | 74.690 | 5088419 | 20422.901 |
| Denmark | Europe | 1982 | 74.630 | 5117810 | 21688.040 |
| Denmark | Europe | 1987 | 74.800 | 5127024 | 25116.176 |
| Denmark | Europe | 1992 | 75.330 | 5171393 | 26406.740 |
| Denmark | Europe | 1997 | 76.110 | 5283663 | 29804.346 |
| Denmark | Europe | 2002 | 77.180 | 5374693 | 32166.500 |
| Denmark | Europe | 2007 | 78.332 | 5468120 | 35278.419 |
| Germany | Europe | 1952 | 67.500 | 69145952 | 7144.114 |
| Germany | Europe | 1957 | 69.100 | 71019069 | 10187.827 |
| Germany | Europe | 1962 | 70.300 | 73739117 | 12902.463 |
| Germany | Europe | 1967 | 70.800 | 76368453 | 14745.626 |
| Germany | Europe | 1972 | 71.000 | 78717088 | 18016.180 |
| Germany | Europe | 1977 | 72.500 | 78160773 | 20512.921 |
| Germany | Europe | 1982 | 73.800 | 78335266 | 22031.533 |
| Germany | Europe | 1987 | 74.847 | 77718298 | 24639.186 |
| Germany | Europe | 1992 | 76.070 | 80597764 | 26505.303 |
| Germany | Europe | 1997 | 77.340 | 82011073 | 27788.884 |
| Germany | Europe | 2002 | 78.670 | 82350671 | 30035.802 |
| Germany | Europe | 2007 | 79.406 | 82400996 | 32170.374 |
| Italy | Europe | 1952 | 65.940 | 47666000 | 4931.404 |
| Italy | Europe | 1957 | 67.810 | 49182000 | 6248.656 |
| Italy | Europe | 1962 | 69.240 | 50843200 | 8243.582 |
| Italy | Europe | 1967 | 71.060 | 52667100 | 10022.401 |
| Italy | Europe | 1972 | 72.190 | 54365564 | 12269.274 |
| Italy | Europe | 1977 | 73.480 | 56059245 | 14255.985 |
| Italy | Europe | 1982 | 74.980 | 56535636 | 16537.483 |
| Italy | Europe | 1987 | 76.420 | 56729703 | 19207.235 |
| Italy | Europe | 1992 | 77.440 | 56840847 | 22013.645 |
| Italy | Europe | 1997 | 78.820 | 57479469 | 24675.024 |
| Italy | Europe | 2002 | 80.240 | 57926999 | 27968.098 |
| Italy | Europe | 2007 | 80.546 | 58147733 | 28569.720 |
| Slovak Republic | Europe | 1952 | 64.360 | 3558137 | 5074.659 |
| Slovak Republic | Europe | 1957 | 67.450 | 3844277 | 6093.263 |
| Slovak Republic | Europe | 1962 | 70.330 | 4237384 | 7481.108 |
| Slovak Republic | Europe | 1967 | 70.980 | 4442238 | 8412.902 |
| Slovak Republic | Europe | 1972 | 70.350 | 4593433 | 9674.168 |
| Slovak Republic | Europe | 1977 | 70.450 | 4827803 | 10922.664 |
| Slovak Republic | Europe | 1982 | 70.800 | 5048043 | 11348.546 |
| Slovak Republic | Europe | 1987 | 71.080 | 5199318 | 12037.268 |
| Slovak Republic | Europe | 1992 | 71.380 | 5302888 | 9498.468 |
| Slovak Republic | Europe | 1997 | 72.710 | 5383010 | 12126.231 |
| Slovak Republic | Europe | 2002 | 73.800 | 5410052 | 13638.778 |
| Slovak Republic | Europe | 2007 | 74.663 | 5447502 | 18678.314 |
| Sweden | Europe | 1952 | 71.860 | 7124673 | 8527.845 |
| Sweden | Europe | 1957 | 72.490 | 7363802 | 9911.878 |
| Sweden | Europe | 1962 | 73.370 | 7561588 | 12329.442 |
| Sweden | Europe | 1967 | 74.160 | 7867931 | 15258.297 |
| Sweden | Europe | 1972 | 74.720 | 8122293 | 17832.025 |
| Sweden | Europe | 1977 | 75.440 | 8251648 | 18855.725 |
| Sweden | Europe | 1982 | 76.420 | 8325260 | 20667.381 |
| Sweden | Europe | 1987 | 77.190 | 8421403 | 23586.929 |
| Sweden | Europe | 1992 | 78.160 | 8718867 | 23880.017 |
| Sweden | Europe | 1997 | 79.390 | 8897619 | 25266.595 |
| Sweden | Europe | 2002 | 80.040 | 8954175 | 29341.631 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.748 |
# make the result table of arranging
arrange_dat %>%
kable("html", caption = "Result Table after arranging by `gdpPercap`",
col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, width = "10em", border_right = T) %>%
column_spec(2, width = "10em")%>%
scroll_box(width = "900px", height = "400px")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Italy | Europe | 1952 | 65.940 | 47666000 | 4931.404 |
| Slovak Republic | Europe | 1952 | 64.360 | 3558137 | 5074.659 |
| Slovak Republic | Europe | 1957 | 67.450 | 3844277 | 6093.263 |
| Italy | Europe | 1957 | 67.810 | 49182000 | 6248.656 |
| Germany | Europe | 1952 | 67.500 | 69145952 | 7144.114 |
| Slovak Republic | Europe | 1962 | 70.330 | 4237384 | 7481.108 |
| Italy | Europe | 1962 | 69.240 | 50843200 | 8243.582 |
| Slovak Republic | Europe | 1967 | 70.980 | 4442238 | 8412.902 |
| Sweden | Europe | 1952 | 71.860 | 7124673 | 8527.845 |
| Slovak Republic | Europe | 1992 | 71.380 | 5302888 | 9498.468 |
| Slovak Republic | Europe | 1972 | 70.350 | 4593433 | 9674.168 |
| Denmark | Europe | 1952 | 70.780 | 4334000 | 9692.385 |
| Sweden | Europe | 1957 | 72.490 | 7363802 | 9911.878 |
| Italy | Europe | 1967 | 71.060 | 52667100 | 10022.401 |
| Germany | Europe | 1957 | 69.100 | 71019069 | 10187.827 |
| Slovak Republic | Europe | 1977 | 70.450 | 4827803 | 10922.664 |
| Denmark | Europe | 1957 | 71.810 | 4487831 | 11099.659 |
| Slovak Republic | Europe | 1982 | 70.800 | 5048043 | 11348.546 |
| Slovak Republic | Europe | 1987 | 71.080 | 5199318 | 12037.268 |
| Slovak Republic | Europe | 1997 | 72.710 | 5383010 | 12126.231 |
| Italy | Europe | 1972 | 72.190 | 54365564 | 12269.274 |
| Sweden | Europe | 1962 | 73.370 | 7561588 | 12329.442 |
| Germany | Europe | 1962 | 70.300 | 73739117 | 12902.463 |
| Denmark | Europe | 1962 | 72.350 | 4646899 | 13583.314 |
| Slovak Republic | Europe | 2002 | 73.800 | 5410052 | 13638.778 |
| Italy | Europe | 1977 | 73.480 | 56059245 | 14255.985 |
| Germany | Europe | 1967 | 70.800 | 76368453 | 14745.626 |
| Sweden | Europe | 1967 | 74.160 | 7867931 | 15258.297 |
| Denmark | Europe | 1967 | 72.960 | 4838800 | 15937.211 |
| Italy | Europe | 1982 | 74.980 | 56535636 | 16537.483 |
| Sweden | Europe | 1972 | 74.720 | 8122293 | 17832.025 |
| Germany | Europe | 1972 | 71.000 | 78717088 | 18016.180 |
| Slovak Republic | Europe | 2007 | 74.663 | 5447502 | 18678.314 |
| Sweden | Europe | 1977 | 75.440 | 8251648 | 18855.725 |
| Denmark | Europe | 1972 | 73.470 | 4991596 | 18866.207 |
| Italy | Europe | 1987 | 76.420 | 56729703 | 19207.235 |
| Denmark | Europe | 1977 | 74.690 | 5088419 | 20422.901 |
| Germany | Europe | 1977 | 72.500 | 78160773 | 20512.921 |
| Sweden | Europe | 1982 | 76.420 | 8325260 | 20667.381 |
| Denmark | Europe | 1982 | 74.630 | 5117810 | 21688.040 |
| Italy | Europe | 1992 | 77.440 | 56840847 | 22013.645 |
| Germany | Europe | 1982 | 73.800 | 78335266 | 22031.533 |
| Sweden | Europe | 1987 | 77.190 | 8421403 | 23586.929 |
| Sweden | Europe | 1992 | 78.160 | 8718867 | 23880.017 |
| Germany | Europe | 1987 | 74.847 | 77718298 | 24639.186 |
| Italy | Europe | 1997 | 78.820 | 57479469 | 24675.024 |
| Denmark | Europe | 1987 | 74.800 | 5127024 | 25116.176 |
| Sweden | Europe | 1997 | 79.390 | 8897619 | 25266.595 |
| Denmark | Europe | 1992 | 75.330 | 5171393 | 26406.740 |
| Germany | Europe | 1992 | 76.070 | 80597764 | 26505.303 |
| Germany | Europe | 1997 | 77.340 | 82011073 | 27788.884 |
| Italy | Europe | 2002 | 80.240 | 57926999 | 27968.098 |
| Italy | Europe | 2007 | 80.546 | 58147733 | 28569.720 |
| Sweden | Europe | 2002 | 80.040 | 8954175 | 29341.631 |
| Denmark | Europe | 1997 | 76.110 | 5283663 | 29804.346 |
| Germany | Europe | 2002 | 78.670 | 82350671 | 30035.802 |
| Denmark | Europe | 2002 | 77.180 | 5374693 | 32166.500 |
| Germany | Europe | 2007 | 79.406 | 82400996 | 32170.374 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.748 |
| Denmark | Europe | 2007 | 78.332 | 5468120 | 35278.419 |
After evaluating the difference between the two functions, I then plot the three resulting dataset to see how the effects of functions are reflected on the data visualization.
# plot the three cases and make them side by side
plot1 <- sub_dat %>%
ggplot(aes(year, gdpPercap, colour = country))+
geom_point()+
geom_line()+
theme_bw()+
ggtitle("Plot for gdpPercap per year \n - original")
plot2 = reorder_dat %>%
ggplot(aes(year, gdpPercap, colour = country))+
geom_point()+
geom_line()+
theme_bw()+
ggtitle("Plot for gdpPercap per year \n - reordering")
plot3 = arrange_dat %>%
ggplot(aes(year, gdpPercap, colour = country))+
geom_point()+
geom_line()+
theme_bw()+
ggtitle("Plot for gdpPercap per year \n - arranging")
grid.arrange(plot1,plot2,plot3,ncol = 3)From the plots we can see that, even though the arrange seems to change the orders on the observations, it does not have effects on the plot. This can be seen by that the color for each country in the plot is the same as the plot of original sub-dataset. In the contrast, the colors for countries in the plot of reordering change which means reorder can affect the plot result. (Indeed now the colors are in the descending order in the middle plot.)
Part 2: File I/O
Task Description: Experiment with write_csv()/ read_csv() , saveRDS()/ readRDS(). Create something new, probably by filtering or grouped-summarization of Singer or Gapminder. Fiddle with the factor levels, i.e. make them non-alphabetical. Explore whether this survives the round trip of writing to file then reading back in.
For this part, I will first reorder the levels similarly as first part but by the maximum population (descending) for the countries in Europe. Note that after the reordering, the levels for the country are no longer listed alphabetically.
# get the observations in Europe
gap_Europe <- gapminder %>%
filter(continent == "Europe")
# reorder the newly created data
gap_Europe_reorder<- gap_Europe %>%
mutate(country = fct_reorder(country, pop, max, .desc = TRUE))
# first a few levels after reordering
head(levels(gap_Europe_reorder$country))## [1] "Germany" "Turkey" "France" "United Kingdom"
## [5] "Italy" "Spain"
# comparing to the levels before reordering
head(levels(gap_Europe$country))## [1] "Afghanistan" "Albania" "Algeria" "Angola" "Argentina"
## [6] "Australia"
Then I will write the dataset into a csv file using write_csv() and read it using read_csv(). I will also experiment with the saveRDS() and readRDS() similarly to see the difference between these two sets of writing and reading functions. The table given for the original dataset and import datasets are the same, however, after using write_csv()/ read_csv(), the country now becomes a character instead of factor. In order to check the levels, I use as.factor() function to transfer country into a factor, however, after this step, the imported csv file will not retain the reordered country levels in the original dataset. In the meanwhile, saveRDS()/ readRDS() will keep the attribute of the variables thus keeping this reordered country levels.
# write the reordered dataset into csv
write_csv(gap_Europe_reorder, "gap_Europe_reorder.csv")
# write the reordered dataset into rds
saveRDS(gap_Europe_reorder, "gap_Europe_reorder.rds")
# read the newly created csv file
import_csv = read_csv("gap_Europe_reorder.csv")
# read the newly created rds file
import_rds = readRDS("gap_Europe_reorder.rds")
# make tables for the original dataset and import datasets
head(gap_Europe_reorder)%>%
kable("html", caption = "First Parts of the Reordered Observations in original `gap_Europe_reorder`",col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "10em")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Albania | Europe | 1952 | 55.23 | 1282697 | 1601.056 |
| Albania | Europe | 1957 | 59.28 | 1476505 | 1942.284 |
| Albania | Europe | 1962 | 64.82 | 1728137 | 2312.889 |
| Albania | Europe | 1967 | 66.22 | 1984060 | 2760.197 |
| Albania | Europe | 1972 | 67.69 | 2263554 | 3313.422 |
| Albania | Europe | 1977 | 68.93 | 2509048 | 3533.004 |
head(import_csv) %>%
kable("html", caption = "First Parts of the Reordered Observations by reading `gap_Europe_reorder.csv` ",col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "10em")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Albania | Europe | 1952 | 55.23 | 1282697 | 1601.056 |
| Albania | Europe | 1957 | 59.28 | 1476505 | 1942.284 |
| Albania | Europe | 1962 | 64.82 | 1728137 | 2312.889 |
| Albania | Europe | 1967 | 66.22 | 1984060 | 2760.197 |
| Albania | Europe | 1972 | 67.69 | 2263554 | 3313.422 |
| Albania | Europe | 1977 | 68.93 | 2509048 | 3533.004 |
head(import_rds) %>%
kable("html", caption = "First Parts of the Reordered Observations by reading `gap_Europe_reorder.rds` ",col.names = c("Country", "Continent", "Year", "Life Expectancy", "Population", "GDP per capita")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),full_width = F)%>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "10em")| Country | Continent | Year | Life Expectancy | Population | GDP per capita |
|---|---|---|---|---|---|
| Albania | Europe | 1952 | 55.23 | 1282697 | 1601.056 |
| Albania | Europe | 1957 | 59.28 | 1476505 | 1942.284 |
| Albania | Europe | 1962 | 64.82 | 1728137 | 2312.889 |
| Albania | Europe | 1967 | 66.22 | 1984060 | 2760.197 |
| Albania | Europe | 1972 | 67.69 | 2263554 | 3313.422 |
| Albania | Europe | 1977 | 68.93 | 2509048 | 3533.004 |
# Check the levels for original dataset
head(levels(gap_Europe_reorder$country))## [1] "Germany" "Turkey" "France" "United Kingdom"
## [5] "Italy" "Spain"
# check variable `country`'s attribute
class(import_csv$country)## [1] "character"
# Check the levels for import csv file
head(levels(as.factor(import_csv$country)))## [1] "Albania" "Austria"
## [3] "Belgium" "Bosnia and Herzegovina"
## [5] "Bulgaria" "Croatia"
# check the levels for import rds file
head(levels(import_rds$country))## [1] "Germany" "Turkey" "France" "United Kingdom"
## [5] "Italy" "Spain"
Part 3: Visualization Design
Task Description: Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Reflect on the differences of your first attempt and what you obtained after some time spent working on it. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Then, make a new graph by converting this visual to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?
# original plot from homework2
gapminder %>% ggplot(aes(gdpPercap, lifeExp)) + scale_x_log10()+
geom_point() +
geom_smooth() +
facet_wrap(~continent, ncol=3)For this part, first I will start by cleaning the a plot from homework2 and try out the theming. First I add labels and title using the labs() function, then I added the black and white theme using theme_bw(), then I adjust the theme using theme() function to adjust the text size and color in the axis as well as change the color of background. Now the plot has more information based on the continent color and the dollar sign in the x axis can reflect more information of the dataset.
# changing the look of the graphic using theme() layer
(plot5 <- gapminder %>% ggplot(aes(gdpPercap, lifeExp)) +
scale_x_log10(labels= dollar_format())+
geom_point(alpha = 0.3, aes(color = continent))+
geom_smooth() +
facet_wrap(~continent, ncol=3)+
labs(x = "GDP per capita",
y = "Life Expectancy",
title = "Plot for GDP per capita versus life expectancy in the five continents" )+
theme_bw()+
theme(axis.text = element_text(size = 8),
strip.background = element_rect(fill = "green4"),
strip.text = element_text(color = "white")))I will create another plot and make use of the what I have learned in the recent class meetings about visualization design and color to make it more effective.
(plot4 <-
gapminder %>%
# get only the countries of interest
filter(country %in% c("Thailand", "Vietnam"))%>%
ggplot(aes(gdpPercap, lifeExp, shape = country, color = pop))+
# scale the gdpPercap
scale_x_log10(labels = dollar_format())+
geom_point(aes(size = pop))+
scale_size_area()+
scale_color_gradient(low = "#0091ff", high = "#f0650e")+
# add labels and title
labs(x = "GDP per capita",
y = "Life Expectancy",
title = "Plot for GDP per capita versus life expectancy of Thailand and Vietnam" )+
# add theme
theme_bw())Next, I will convert the plot4 (GDP per capita versus life expectancy of Thailand and Vietnam) and plot5 (GDP per capita versus life expectancy in the five continents) into plotly.
In general, plotly provides us a toolbar to interact with the plot. We could zoom in and out and even directly download the plot. Moreover, by hovering close to the data point, plotly plot will automatically show detailed information of this datapoint. From the plot by converting plot5 I found a very useful function plotly provides, which is that if you click on the continent legend on the right, plotly will remove all the points on the corresponding plot. This is useful if you want to take a closer look at the smooth line as well as if you have overlapping plots. In light of this finding, I found that the plotly has the highlighting function.
plotly::ggplotly(plot4)plotly::ggplotly(plot5)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
g <- crosstalk::SharedData$new(gapminder, ~continent)
plot6 <- ggplot(g, aes(gdpPercap, lifeExp, color = continent, frame = year)) +
geom_point(aes(size = pop, ids = country)) +
geom_smooth(se = FALSE, method = "lm") +
scale_x_log10()## Warning: Ignoring unknown aesthetics: ids
plotly::ggplotly(plot6) %>%
plotly::highlight("plotly_hover")## Setting the `off` event (i.e., 'plotly_doubleclick') to match the `on` event (i.e., 'plotly_hover'). You can change this default via the `highlight()` function.
Part 4: Writing figures to file
“But I want to do more!” - Make a deeper exploration of the forcats packages
Reference and Source
Sequential, diverging and qualitative colour scales from colorbrewer.org
Top 50 ggplot2 Visualizations - The Master List (With Full R Code)
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
gganimate vs. plotly - Which is better at animation?